Ryan St.Pierre (ras70) Dakota Brinkman (dlb46) Matt Olson (meo8) Jeremy Schreck (jes85)
The biological question we are trying to answer is how robust metabolic cycles within yeast cells respond under continuous, glucose limited conditions. The two papers we were given try to correlate the chromatin states and this metabolic expression. We will first be focusing on the gene expression under these conditions, what this expression reveals about the function of the cell under these conditions, and finally what TFs control the given clusters and how their manner of control creates complex functionality. We will also try to confirm that the metabolic processes respond in an oscillatory fashion in which there is a “just in-time supply chain,” meaning the clusters should oscillate in a delayed fashion with respect to each other.
The scope of this project is to investigate the clusters of the data given and their expression patterns (done in this homework) and the transcription factor networks that control this oscillatory behavior (done in the next homework).
Data scope
In our clustering analysis we used most of the data given. We were given two expression sets: microarray and rnaSeq. We used all of the microarray data, since we noticed no irregularities, such as NaN or zero expression. In the rnaSeq we filtered the data slightly to remove genes from the dataset that had contant zero expression. This filtering is done within the code itself, not manually in the .csv file, and can be referenced in the code section below. We did notice that in both datasets some of the genes are included twice. We considered removing these duplications. However, we noticed that even though the genes were repeated their expression over time was not the same across duplicated. For this reason we kept these genes in the dataset.
The sure size of the dataset did create some complications. Particuarly, we created large clusters, which made generating heatmaps for each cluster too computationally intensive to complete in a timely manner. We did consider taking a subset of the original data for this reason. However, we could not come up with a sensable way to cut the size of the data down. We briefly considered cutting the size of the data down based on gene expression variance (removing the least variant genes). However, we decided that this really had no true justification, especially in the context of the findings in our primary papers, since gene expression variance isn’t correlated to the particular clustering of genes necessarily. Our primary papers found large “super-clusters” and we did not want to eliminate this result by improperly taking a subset of the data. Thus, as a compromise we decided to visualize sub-sets of clusters with heatmaps once the clusters were already generated.
Before starting clustering analysis we first wanted to import the data and manipulate it into a more usable form. The original data, both for the rnaSeq and microarray, was given with genes as rows and time as columns. However, all clustering techniques we know involve clustering columns, not rows. Thus, we had to take the transpose of the original data. Additionally, the original columns corresponded to time. However, the absolute time in each column was not given. Thus, we incorporated this time into the data for certain uses, such as time plots.
Libraries Needed
kuang <- read.csv("./datasets/kuang-2014-microarray-expression.csv") ##local- needs to be changed
time <- read.csv("./datasets/kuang-2014-microarry-expression-timepts.csv")
head(kuang[0:5]) ##preview data sets
tail(kuang[0:5])
rnaseq <- read.csv("./datasets/kuang-2014-rnaseq-expression.csv") #local needs to be changed
rnatime <- read.csv("./datasets/kuang-2014-rnaseq-expression-timepts.csv")
dim(kuang) ##Get a sense of the size
[1] 6241 37
dim(rnaseq)
[1] 6241 17
head(time)
head(rnatime)
Using the dim command we were able to get a sense of the differences between the microarray and rnaSeq data. Given the dimensions they both contain the same amount of genes (6241). However, the rnaSeq has less time data points.
Also, by previewing the different time data sets we concluded that the microarray expression is sampled every 23 minutes, while the rnaSeq expression is sampled every 0.31 hours or 18.6 minutes. This led us to conclude that the rnaSeq has a higher sampling rate. We thought that this may allow us to see more granular changes in expression oscillation over time in the rnaSeq data.
# Transpose of kuang saved as kuang_t
n <- kuang$Gene.name # first remember the names
# transpose all but the first column (Gene.name)
kuang_t <- as.data.frame(t(kuang[,-1]))
colnames(kuang_t) <- n
dim(kuang_t)
[1] 36 6241
dim(time)
[1] 36 1
kuang_t.time <- cbind(kuang_t, time = time$minute)
head(kuang_t.time[6241:6242]) ##preview end of transpose dataframe to ensure the work was done correctly
colnames(kuang_t) %>% length()
[1] 6241
unique.genes <- colnames(kuang_t) %>% unique()
length(unique.genes)
[1] 6035
While looking at the data set we observed multiple genes that were present more than once. Using the unique function we can see that there are 6241 genes in the data set, but only 6035 unique genes. We are not sure for the reasoning behind this, our best guess is redundancy, but it is something to keep in mind going forward.
At this point we were still trying to get familiar with our dataset. Thus, we generated time plots of the first 10 genes in the dataset.
#visual of first 10 genes
first.10.genes <- c("YAL001C", "YAL002W", "YAL003W", "YAL004W", "YAL005C", "YAL007C", "YAL008W", "YAL009W", "YAL010C", "YAL011W")
ggplot(kuang_t.time, aes(x = time, y = YAL001C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL001C")
ggplot(kuang_t.time, aes(x = time, y = YAL002W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL002W")
ggplot(kuang_t.time, aes(x = time, y = YAL003W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL003W")
ggplot(kuang_t.time, aes(x = time, y = YAL004W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL004W")
ggplot(kuang_t.time, aes(x = time, y = YAL005C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL005C")
ggplot(kuang_t.time, aes(x = time, y = YAL007C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL007C")
ggplot(kuang_t.time, aes(x = time, y = YAL008W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL008W")
ggplot(kuang_t.time, aes(x = time, y = YAL009W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL009W")
ggplot(kuang_t.time, aes(x = time, y = YAL010C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL010C")
ggplot(kuang_t.time, aes(x = time, y = YAL011W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL011W")
It is clear from these plots that all the genes show an oscillating pattern of expression. However, these ten genes don’t necessarily have oscillations that are similar over time.
Complete method, with 1-Pearson correlation as the distance matrix
First, we attempted hierarchical clustering using the complete method with 1-Pearson’s correlation as the distance metric. The first thing we did was create the correlation matrix and then distance matrix of the transpose of the original data (with genes and the columns). We are clustered the transpose of the original matrix because we wanted to cluster genes based on their relative correlation over time. All of the clustering techniques we know cluster the columns, not rows, of the dataset.
## Getting unique names for all the columns
genes_used <- c()
for (i in 1:length(kuang_t)) {
if (names(kuang_t)[i] %in% genes_used) {
names(kuang_t)[i]<-paste(names(kuang_t)[i],i)
}
genes_used <- c(genes_used,names(kuang_t)[i])
}
## These take a lot of time so they are moved into their own code segment
kuang_t.cor <- kuang_t %>% cor(use="pairwise.complete.obs") ##generate correlation measurements
kuang.dist <- as.dist(1 - kuang_t.cor) ##use correlation to find distance measurements
kuang.tree <- hclust(kuang.dist, method="complete")
kuang.dend <- as.dendrogram(kuang.tree) ## created & priviewed dendrogram object
To get an idea of the data we plotted the dendrogram. From the dendrogram it is clear that there are significant clusters in the dataset.
plot(kuang.dend, leaflab = "none") ## removing labels
To further get a feel for the dataset we plotted the dendrogram at several different cluster values (2, 4, and 8)- color coded to best analyze the data.
clusters_2 <- cutree(kuang.dend, k=2)
plot(color_branches(kuang.dend, k=2),leaflab="none")
table(clusters_2)
clusters_2
1 2
2956 3285
clusters_4 <- cutree(kuang.dend, k=4)
plot(color_branches(kuang.dend, k=4),leaflab="none")
table(clusters_4)
clusters_4
1 2 3 4
2253 703 2313 972
clusters_8 <- cutree(kuang.dend, k=8)
plot(color_branches(kuang.dend, k=8),leaflab="none")
table(clusters_8)
clusters_8
1 2 3 4 5 6 7 8
1535 703 404 718 1577 764 208 332
After reading our paper, we saw that they found three major clusters. We next did hierarchical clustering using k = 3 to see if our clusters align with the clusters found in the paper.
clusters_3 <- cutree(kuang.dend, k=3)
plot(color_branches(kuang.dend, k=3),leaflab="none", main = "Original Microarray dendrogram with k = 3")
table(clusters_3)
clusters_3
1 2 3
2956 2313 972
We then cut the tree to extract these three clusters.
cut1 <- cut(kuang.dend, h = 1.9)
cut1$lower
[[1]]
'dendrogram' with 2 branches and 2956 members total, at height 1.876589
[[2]]
'dendrogram' with 2 branches and 972 members total, at height 1.777684
[[3]]
'dendrogram' with 2 branches and 2313 members total, at height 1.863976
The first cluster countains 2956 genes, the second contains 972 genes, and the third contains 2313 genes.
To get a better idea we performed hierarchical clustering, starting with the first cluster on the left (pink)
plot(cut1$lower[[1]], leaflab="none", main = "Cluster 1 dendrogram")
We then tried plotting this at dendrogram at different k values to see what looks good.
plot(color_branches(cut1$lower[[1]], k=3), leaflab="none", main = "Cluster 1 dendrogram with k = 3")
The first two clusters (pink and green) looked pretty compact to us, so we did not explore them anymore. The third (blue) cluster appeared to have some significant subclusters in it, so we explored it more.
cluster1.cut1 <- cut(cut1$lower[[1]], h = 1.75)
cluster1.cut1$lower
[[1]]
'dendrogram' with 2 branches and 703 members total, at height 1.6383
[[2]]
'dendrogram' with 2 branches and 718 members total, at height 1.708666
[[3]]
'dendrogram' with 2 branches and 1535 members total, at height 1.7321
plot(cluster1.cut1$lower[[3]], leaflab = "none")
We saw that there appeared to be 4 main clusters. Adding color allowed us to visualize if our observation held true.
plot(color_branches(cluster1.cut1$lower[[3]], k=4), leaflab="none", main = "Cluster 1 subtree dendrogram with k = 4")
That looked good to us. We then extracted these four clusters to see how large they were and what genes they contained.
cluster1.cut1.cutAgain <- cut(cluster1.cut1$lower[[3]], h = 1.4)
cluster1.cut1.cutAgain$lower
[[1]]
'dendrogram' with 2 branches and 312 members total, at height 1.361629
[[2]]
'dendrogram' with 2 branches and 479 members total, at height 1.360901
[[3]]
'dendrogram' with 2 branches and 342 members total, at height 1.138894
[[4]]
'dendrogram' with 2 branches and 402 members total, at height 1.306937
Next we looked at the smallest cluster ([[2]]).
plot(color_branches(cut1$lower[[2]], k=2), leaflab="none", main = "Cluster 2 dendrogram with k = 2")
Looking at this we saw there were two main clusters. We then cut this subtree again in order to extract these two subclusters.
cut1.cut2 <- cut(cut1$lower[[2]], h = 1.6)
cut1.cut2$lower
[[1]]
'dendrogram' with 2 branches and 764 members total, at height 1.413171
[[2]]
'dendrogram' with 2 branches and 208 members total, at height 1.508391
The smaller cluster (green) had 208 genes in it. This is a good size to work with, so we continued exploring this subcluster. First we plotted it.
plot(cut1.cut2$lower[[2]], leaflab = "none", main = "Subcluster of Cluster 2")
4 main clusters stuck out to us when looking at the above plot, so we plotted with color and k = 4.
plot(color_branches(cut1.cut2$lower[[2]], k=4), leaflab="none", main = "Subcluster of Cluster 2 dendrogram with k = 4")
We then extracted these four clusters.
The sizes looked reasonable to work with.
Lastly we analyzed the third ([[3]]) cluster (blue in the original dendrogram).
plot(cut1$lower[[3]], leaflab="none", main = "Cluster 3 dendrogram")
On this dendrogram we tried plotting it with different k values to see which made the most sense.
plot(color_branches(cut1$lower[[3]], k=5), leaflab="none", main = "Cluster 3 dendrogram with k = 5")
A k value of 5 looked good. As we have before, we then extracted these clusters by cutting the dendrogram.
cluster3.cut1 <- cut(cut1$lower[[3]], h = 1.6)
cluster3.cut1$lower
[[1]]
'dendrogram' with 2 branches and 332 members total, at height 1.565022
[[2]]
'dendrogram' with 2 branches and 404 members total, at height 1.519815
[[3]]
'dendrogram' with 2 branches and 141 members total, at height 1.563054
[[4]]
'dendrogram' with 2 branches and 904 members total, at height 1.566672
[[5]]
'dendrogram' with 2 branches and 532 members total, at height 1.5888
Given our hierarchical clustering we found the following clusters:
However, at this point we were unsure if the number of clusters we found was “correct”. Specifically we were unsure if the sub-clustering work we did of original clusters was significant. To explore these questions we tried investigating the data further, shown below.
We wanted to figure out how many clusters to choose for this assignment. In the Kuang paper, they chose 3 main clusters of 537, 1608, and 980. Below is a plot showing how the number of clusters relates to the average cluster size.
plot(c(
mean(table(cutree(kuang.dend, k=1))),
mean(table(cutree(kuang.dend, k=2))),
mean(table(cutree(kuang.dend, k=3))),
mean(table(cutree(kuang.dend, k=4))),
mean(table(cutree(kuang.dend, k=5))),
mean(table(cutree(kuang.dend, k=6))),
mean(table(cutree(kuang.dend, k=7))),
mean(table(cutree(kuang.dend, k=8))),
mean(table(cutree(kuang.dend, k=9))),
mean(table(cutree(kuang.dend, k=10))),
mean(table(cutree(kuang.dend, k=11))),
mean(table(cutree(kuang.dend, k=12))),
mean(table(cutree(kuang.dend, k=13))),
mean(table(cutree(kuang.dend, k=14))),
mean(table(cutree(kuang.dend, k=15)))), xlab="Num Clusters", ylab="Average size of Cluster")
From the plot above, we can see that as the number of clusters increases, the average size of the clusters decreases with a 1/x relationship. From our prior knowledge of clustering and machine learning, the cluster size at the “elbow” is the best choice because it minimizes overfitting while still representing the data appropriately. The ideal number of clusters from this analysis is between 3 and 10. This graph seems to support our previous findings, but may suggest that 10 may be on the high end of potential number of clusters in the data.
When examing what clusters were best and how many clusters we should have we thought it might be useful to look at the distributions of correlations within clusters.
Let’s first look at the correlation distribitions of the original 3 (top 3) clusters we made.
cluster1.genes <- as.character(labels(cut1$lower[[1]]))
cluster2.genes <- as.character(labels(cut1$lower[[2]]))
cluster3.genes <- as.character(labels(cut1$lower[[3]]))
kuang_t.cor1 <- kuang_t.cor[cluster1.genes,cluster1.genes]
kuang_t.cor2 <- kuang_t.cor[cluster2.genes,cluster2.genes]
kuang_t.cor3 <- kuang_t.cor[cluster3.genes,cluster3.genes]
f1 <- upper.tri(kuang_t.cor1, diag=FALSE)
f2 <- upper.tri(kuang_t.cor2, diag=FALSE)
f3 <- upper.tri(kuang_t.cor3, diag=FALSE)
df1 <- data.frame(corr = as.vector(kuang_t.cor1[f1]))
df2 <- data.frame(corr = as.vector(kuang_t.cor2[f2]))
df3 <- data.frame(corr = as.vector(kuang_t.cor3[f3]))
Histogram <- rbind(df1, df2, df3)
ggplot(Histogram, aes(corr)) +
geom_histogram(data = df1, fill = "red", alpha = 0.2, bins = 30) +
geom_histogram(data = df2, fill = "green", alpha = 0.2, bins = 30) +
geom_histogram(data = df3, fill = "blue", alpha = 0.2, bins = 30)
In the above plot the distribution of correlation for the top 3 clusters (before subclustering). From this plot we concluded that cluster 2 was the best cluster out of the three, since more of its elements have a correlation near 1 to other members in the group. From this visulization it was also made clear that cluster 2 had less members than the other two clusters, which helped to describe why its distribution was tighter.
Although the figure above is a nice visualization, it is not too enlightening. It told us behavior about the first three clusters, but not necessarily if we should sub-cluster further. Thus, we thought it would be helpful to plot the distribution of some clusters with their sub-clusters, to see if sub-clustering one more level improved the correlation distribution significantly.
cluster1.genes <- as.character(labels(cut1$lower[[1]]))
cluster1.cut1.genes1 <- as.character(labels(cluster1.cut1$lower[[1]]))
cluster1.cut1.genes2 <- as.character(labels(cluster1.cut1$lower[[2]]))
cluster1.cut1.genes3 <- as.character(labels(cluster1.cut1$lower[[3]]))
kuang_t.cor1 <- kuang_t.cor[cluster1.genes,cluster1.genes]
kuang_t.cor1.1 <- kuang_t.cor[cluster1.cut1.genes1,cluster1.cut1.genes1]
kuang_t.cor1.2 <- kuang_t.cor[cluster1.cut1.genes2,cluster1.cut1.genes2]
kuang_t.cor1.3 <- kuang_t.cor[cluster1.cut1.genes3,cluster1.cut1.genes3]
f1 <- upper.tri(kuang_t.cor1, diag=FALSE)
f1.1 <- upper.tri(kuang_t.cor1.1, diag=FALSE)
f1.2 <- upper.tri(kuang_t.cor1.2, diag=FALSE)
f1.3 <- upper.tri(kuang_t.cor1.3, diag=FALSE)
df1 <- data.frame(corr = as.vector(kuang_t.cor1[f1]))
df1.1 <- data.frame(corr = as.vector(kuang_t.cor1.1[f1.1]))
df1.2 <- data.frame(corr = as.vector(kuang_t.cor1.2[f1.2]))
df1.3 <- data.frame(corr = as.vector(kuang_t.cor1.3[f1.3]))
Histogram <- rbind(df1, df1.1, df1.2, df1.3)
ggplot(Histogram, aes(corr)) +
geom_histogram(data = df1, fill = "black", alpha = 0.05, bins = 30) +
geom_histogram(data = df1.1, fill = "blue", alpha = 0.2, bins = 30) +
geom_histogram(data = df1.2, fill = "green", alpha = 0.2, bins = 30) +
geom_histogram(data = df1.3, fill = "red", alpha = 0.2, bins = 30)
The above figure shows the correlation distribution in the first original cluster (faint gray). Additionally, it shows the distribution of the three sub-clusters that we formed from this original cluster. The correlation of the original cluster appeared to be a bell curve, centered around 0.25. However, the sub-clusters looked more like bell-curves centered around 0.3-0.5, with a tighter overal distribution. We found from these results that the subclustering of the original first cluster into three is likely significant, given the shift in distribution by a significant amount.
The most significant of these shifts occurred on the third sub-cluster. In our previous Hierarchical clustering we further sub-clustered this third cluster. However, given these results it is likely that this sub-clustering was not needed.
We repeated this process with the second original cluster, which was further clustered into two subclusters in our previous Hierarchical clustering.
cluster2.genes <- as.character(labels(cut1$lower[[2]]))
cut1.cut2.genes1 <- as.character(labels(cut1.cut2$lower[[1]]))
cut1.cut2.genes2 <- as.character(labels(cut1.cut2$lower[[2]]))
kuang_t.cor2 <- kuang_t.cor[cluster2.genes,cluster2.genes]
kuang_t.cor2.1 <- kuang_t.cor[cut1.cut2.genes1,cut1.cut2.genes1]
kuang_t.cor2.2 <- kuang_t.cor[cut1.cut2.genes2,cut1.cut2.genes2]
f2 <- upper.tri(kuang_t.cor2, diag=FALSE)
f2.1 <- upper.tri(kuang_t.cor2.1, diag=FALSE)
f2.2 <- upper.tri(kuang_t.cor2.2, diag=FALSE)
df2 <- data.frame(corr = as.vector(kuang_t.cor2[f2]))
df2.1 <- data.frame(corr = as.vector(kuang_t.cor2.1[f2.1]))
df2.2 <- data.frame(corr = as.vector(kuang_t.cor2.2[f2.2]))
Histogram <- rbind(df2, df2.1, df2.2)
ggplot(Histogram, aes(corr)) +
geom_histogram(data = df2, fill = "black", alpha = 0.1, bins = 30) +
geom_histogram(data = df2.1, fill = "blue", alpha = 0.2, bins = 30) +
geom_histogram(data = df2.2, fill = "red", alpha = 0.2, bins = 30)
The first observation made from the above figure was that the original cluster 2 (faint gray) was not a good cluster. A lot of its values have negative correlation with other genes in its cluster, evident by the genes to the left of the zero correlation mark. Conversely, the sub-clusters did not have this problem, as they both have much tighter distributions that had little genes with negative correlation values. From this figure we concluded that the further sub-clustering of the original second cluster was justified.
Lastly, we repeated this for the original third cluster.
cluster3.genes <- as.character(labels(cut1$lower[[3]]))
cluster3.cut1.genes1 <- as.character(labels(cluster3.cut1$lower[[1]]))
cluster3.cut1.genes2 <- as.character(labels(cluster3.cut1$lower[[2]]))
cluster3.cut1.genes3 <- as.character(labels(cluster3.cut1$lower[[3]]))
cluster3.cut1.genes4 <- as.character(labels(cluster3.cut1$lower[[4]]))
cluster3.cut1.genes5 <- as.character(labels(cluster3.cut1$lower[[5]]))
kuang_t.cor3 <- kuang_t.cor[cluster3.genes,cluster3.genes]
kuang_t.cor3.1 <- kuang_t.cor[cluster3.cut1.genes1,cluster3.cut1.genes1]
kuang_t.cor3.2 <- kuang_t.cor[cluster3.cut1.genes2,cluster3.cut1.genes2]
kuang_t.cor3.3 <- kuang_t.cor[cluster3.cut1.genes1,cluster3.cut1.genes3]
kuang_t.cor3.4 <- kuang_t.cor[cluster3.cut1.genes2,cluster3.cut1.genes4]
kuang_t.cor3.5 <- kuang_t.cor[cluster3.cut1.genes1,cluster3.cut1.genes5]
f3 <- upper.tri(kuang_t.cor3, diag=FALSE)
f3.1 <- upper.tri(kuang_t.cor3.1, diag=FALSE)
f3.2 <- upper.tri(kuang_t.cor3.2, diag=FALSE)
f3.3 <- upper.tri(kuang_t.cor3.3, diag=FALSE)
f3.4 <- upper.tri(kuang_t.cor3.4, diag=FALSE)
f3.5 <- upper.tri(kuang_t.cor3.5, diag=FALSE)
df3 <- data.frame(corr = as.vector(kuang_t.cor3[f3]))
df3.1 <- data.frame(corr = as.vector(kuang_t.cor3.1[f3.1]))
df3.2 <- data.frame(corr = as.vector(kuang_t.cor3.2[f3.2]))
df3.3 <- data.frame(corr = as.vector(kuang_t.cor3.3[f3.3]))
df3.4 <- data.frame(corr = as.vector(kuang_t.cor3.4[f3.4]))
df3.5 <- data.frame(corr = as.vector(kuang_t.cor3.5[f3.5]))
Histogram <- rbind(df3, df3.1, df3.2, df3.3, df3.4, df3.5)
ggplot(Histogram, aes(corr)) +
geom_histogram(data = df3, fill = "black", alpha = 0.1, bins = 30) +
geom_histogram(data = df3.1, fill = "blue", alpha = 0.2, bins = 30) +
geom_histogram(data = df3.2, fill = "red", alpha = 0.2, bins = 30) +
geom_histogram(data = df3.3, fill = "yellow", alpha = 0.2, bins = 30) +
geom_histogram(data = df3.4, fill = "purple", alpha = 0.2, bins = 30) +
geom_histogram(data = df3.5, fill = "pink", alpha = 0.2, bins = 30)
Again, we found that the further sub-clustering produced a substatial improvement in the clustering of the data.
Our final hierarchical clusters, based on the work shown above (average cluster size vs number of clusters, hierarchical and correlation distributions), are given below.
nleaves(cluster1.cut1$lower[[1]])
[1] 703
nleaves(cluster1.cut1$lower[[2]])
[1] 718
nleaves(cluster1.cut1$lower[[3]])
[1] 1535
nleaves(cut1.cut2$lower[[1]])
[1] 764
nleaves(cut1.cut2$lower[[2]])
[1] 208
nleaves(cluster3.cut1$lower[[1]])
[1] 332
nleaves(cluster3.cut1$lower[[2]])
[1] 404
nleaves(cluster3.cut1$lower[[3]])
[1] 141
nleaves(cluster3.cut1$lower[[4]])
[1] 904
nleaves(cluster3.cut1$lower[[5]])
[1] 532
As a sanity check we made sure we didn’t lose any genes.
nleaves(cluster1.cut1$lower[[1]]) +
nleaves(cluster1.cut1$lower[[2]]) +
nleaves(cluster1.cut1$lower[[3]]) +
nleaves(cut1.cut2$lower[[1]]) +
nleaves(cut1.cut2$lower[[2]]) +
nleaves(cluster3.cut1$lower[[1]]) +
nleaves(cluster3.cut1$lower[[2]]) +
nleaves(cluster3.cut1$lower[[3]]) +
nleaves(cluster3.cut1$lower[[4]]) +
nleaves(cluster3.cut1$lower[[5]])
[1] 6241
The total number of leaves matches the original number of genes in the dataset.
Given our final clusters we generated some heatmaps to show the correlation between genes within clusters. In most cases, the number of genes in a given cluster was too large to create one heatmap. Thus, often we were forced to create heatmaps with certain subsets of clusters. Below are a few of the heat maps we decided to look at:
color.scheme <- rev(brewer.pal(8,"RdBu"))
kuang_t.matrix <- data.matrix(kuang_t, rownames.force = NA)
heatmap.2(t(kuang_t.matrix[,as.character(labels(cluster3.cut1$lower[[3]]))]),
Rowv = NULL,
Colv = NULL,
dendrogram = "none",
breaks = seq(-2, 2, length.out = 9),
col = color.scheme,
trace = "none", density.info = "none",cexRow=0.1,
main = "cluster3.cut1$lower[[3]]"
)
heatmap.2(t(kuang_t.matrix[,as.character(labels(cluster1.cut1$lower[[3]]))]),
Rowv = NULL,
Colv = NULL,
dendrogram = "none",
breaks = seq(-2, 2, length.out = 9),
col = color.scheme,
trace = "none", density.info = "none",cexRow=0.1,
main = "cluster1.cut1$lower[[3]]"
)
heatmap.2(t(kuang_t.matrix[,as.character(labels(cut1.cut2$lower[[1]]))]),
Rowv = NULL,
Colv = NULL,
dendrogram = "none",
breaks = seq(-2, 2, length.out = 9),
col = color.scheme,
trace = "none", density.info = "none",cexRow=0.1,
main = "cut1.cut2$lower[[1]]"
)
heatmap.2(t(kuang_t.matrix[,as.character(labels(cluster3.cut1$lower[[4]]))]),
Rowv = NULL,
Colv = NULL,
dendrogram = "none",
breaks = seq(-2, 2, length.out = 9),
col = color.scheme,
trace = "none", density.info = "none",cexRow=0.1,
main = "cluster3.cut1$lower[[4]]"
)
We were satisfied with the results of our heatmaps as they seemed to suggest similar expression within clusters over time. The x axis for these heat maps was the microarray number (which corresponds to time). It is clear that the expression of these genes move in oscillatory patterns, which confirms the findings from the paper. Furthermore, the genes in a specific cluster behave very similarly.
Each of the cluster express vertical strips of blue, then red, then blue, which suggest the oscillatory patterns. Even more significantly between the clusters these strips seem to be offset, suggesting that there is a delay between the clusters’ oscillatory patterns. This supports the claim of the paper that there is a just in time supply chain in the function of the metabolic cycle.
At the beginning of the paper we generated time plots of a random sample of genes in the dataset. In this section we repeated the process for genes within the same cluster.
First, we converted the time data to long format.
dim(kuang_t.time)
[1] 36 6242
kuang_t.time.long <- gather(kuang_t.time, gene, expression, -time) ##This produces a dataframe with 209 columns due to duplicated column names
cols <- c("time", "gene", "expression")
kuang_t.time.long.fitered <- kuang_t.time.long[,cols]
dim(kuang_t.time.long.fitered)
[1] 224676 3
head(kuang_t.time.long.fitered)
Next, we grabbed the gene names from the first final cluster and selected the first five genes.
clusterf1.genes <- as.character(labels(cluster1.cut1$lower[[1]]))
clusterf1.genes.first5 <- clusterf1.genes[1:5]
Next, we filtered the kuang dataset for genes in this first cluster and plotted them on the same figure.
kuang_t.time.long.fitered %>%
filter(gene %in% clusterf1.genes.first5) %>%
ggplot(aes(x = time, y = expression, color = gene)) + geom_line()
From this plot it is clear the first couple of genes in this cluster have similar expression over time. This confirms the findings from the expression heat maps.
We repeated this step with our other clusters with similar results.
clusterf2.genes <- as.character(labels(cluster1.cut1$lower[[2]]))
clusterf2.genes.first5 <- clusterf2.genes[1:5]
kuang_t.time.long.fitered %>%
filter(gene %in% clusterf2.genes.first5) %>%
ggplot(aes(x = time, y = expression, color = gene)) + geom_line()
clusterf3.genes <- as.character(labels(cluster1.cut1$lower[[3]]))
clusterf3.genes.first5 <- clusterf3.genes[1:5]
kuang_t.time.long.fitered %>%
filter(gene %in% clusterf3.genes.first5) %>%
ggplot(aes(x = time, y = expression, color = gene)) + geom_line()
The above plots support the conclusion that the genes within the clusters chosen express similar oscillatory behavior.
Next, we wanted to explore how different clusters were expressed over time with relation to one another in an attempt to investigate the just in time supply chain conclusion from the paper. First, we chose the three subclusters that were created from the original first cluster.
cluster1 <- kuang_t.time.long.fitered %>%
filter(gene %in% clusterf1.genes.first5) %>%
mutate(cluster = 1)
cluster2 <- kuang_t.time.long.fitered %>%
filter(gene %in% clusterf2.genes.first5) %>%
mutate(cluster = 2)
cluster3 <- kuang_t.time.long.fitered %>%
filter(gene %in% clusterf3.genes.first5) %>%
mutate(cluster = 3)
combined.df <- rbind(cluster1, cluster2, cluster3)
ggplot(combined.df, aes(x = time, y = expression,
color = cluster, group = gene)) +
geom_line(alpha=0.5)
The above figure is the result of plotting the expression of the three subclusters of the first cluster from the original clustering over time. There are three different clusters depicted in this figure. It is clear from the figure that these three clusters have relatively the same expression pattern. Specifically, they peak and reach their minimums at the same time. It is important to note that only the first 5 genes from these clusters were used for clarity.
Also, this may work to suggest that even though further sub-clustering the original first cluster may have improved the correlation distribution in a meaningful way, it may not have significantly improved the synchrony of oscillation within the clusters.
Next, we investigated the gene expression difference between clusters. Specifically, we chose the first sub-cluster from the first original cluster (from hierarchical clustering) and the second sub-cluster from the second original cluster. This is shown below.
cluster1.1 <- kuang_t.time.long.fitered %>%
filter(gene %in% clusterf1.genes.first5) %>%
mutate(cluster = 1)
clusterf2.1.genes <- as.character(labels(cut1.cut2$lower[[2]]))
clusterf2.1.genes.first5 <- clusterf2.1.genes[1:10]
cluster2.1 <- kuang_t.time.long.fitered %>%
filter(gene %in% clusterf2.1.genes.first5) %>%
mutate(cluster = 2)
combined.df <- rbind(cluster1.1, cluster2.1)
ggplot(combined.df, aes(x = time, y = expression,
color = cluster, group = gene)) +
geom_line(alpha=0.5)
Unlike the three sub-clusters from the same parent cluster, the sub-clusters with different parents produced functions that were not in phase. This is easily seen as the two functions peak at different times. This further suggests that clusters exhibit the same oscillatory tendancies, but each cluster may be delayed or out of phase with others. Again, this supports the just in time chain.
In this section we carried out hierarchical clustering with the rnaSeq data similar to how it was performed with the microarray data.
# first remember the names
nrna <- rnaseq$Gene.name
# transpose all but the first column (Gene.name)
rnaseq_t_zeros <- as.data.frame(t(rnaseq[,-1]))
colnames(rnaseq_t_zeros) <- nrna
#dim(rnaseq_t_zeros)
#dim(rnatime)
#tail(rnaseq_t_zeros[0:5])
rnaseq_t_zeros.time <- cbind(rnaseq_t_zeros, time = rnatime$hour)
#head(rnaseq_t_zeros.time[6241:6242])
#visual of first 10 genes
first.10.genes <- c("YAL001C", "YAL002W", "YAL003W", "YAL004W", "YAL005C", "YAL007C", "YAL008W", "YAL009W", "YAL010C", "YAL011W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL001C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL001C")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL002W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL002W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL003W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL003W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL004W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL004W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL005C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL005C")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL007C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL007C")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL008W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL008W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL009W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL009W")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL010C, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL010C")
ggplot(rnaseq_t_zeros.time, aes(x = time, y = YAL011W, group = 1)) + geom_line() + labs(x = "Time (minutes)", y = "Expression of Gene", title = "YAL011W")
From these time plots we made a meaningful conclusion about the rnaSeq data in comparrison to the microarry: the rnaSeq is sampled higher, but over a shorter period of time, thus it appears that one cycle of oscillation expression is recorded, rather than several.
dim(rnaseq_t_zeros)
[1] 16 6241
head(rnaseq_t_zeros[,6240:6241])
Some of these genes have zero expression and need to be removed!
head(rnaseq_t_zeros[, colSums(rnaseq_t_zeros == 0) > 0])
rnaseq_t <- rnaseq_t_zeros[, colSums(rnaseq_t_zeros != 0) > 0]
head(rnaseq_t)
rnaseq_t.time <- cbind(rnaseq_t, time = rnatime$hour)
rnaseq.cor <- cor(rnaseq_t, use = "pairwise.complete.obs")
rnaseq.dist <- as.dist(1-rnaseq.cor)
rnaseq.tree <- hclust(rnaseq.dist, method = "complete")
rnaseq.dend <- as.dendrogram(rnaseq.tree)
plot(rnaseq.dend, leaflab = "none", main = "Original RNAdeq dendrogram")
Compared to the microarray dendrogram, it looked like these clusters are more compact and therefore stronger. We predicted this based off of the lower heights of most clusters. In order to see how true this was, we explored the clusters more.
plot(color_branches(rnaseq.dend, k=4),leaflab="none", main = "Original RNAseq dendrogram with k = 4")
rnaseq.clusters <- cut(rnaseq.dend, h = 1.9)
rnaseq.clusters$lower
[[1]]
'dendrogram' with 2 branches and 939 members total, at height 1.637243
[[2]]
'dendrogram' with 2 branches and 1314 members total, at height 1.896336
[[3]]
'dendrogram' with 2 branches and 2089 members total, at height 1.561592
[[4]]
'dendrogram' with 2 branches and 1849 members total, at height 1.864025
Time to go through these one by one. We started with the first cluster (pink).
plot(rnaseq.clusters$lower[[1]], leaflab = "none", main = "RNAseq Cluster 1")
Two clusters were apparent. We then cut the tree to expore them more.
rnaseq.clusters.cluster1 <- cut(rnaseq.clusters$lower[[1]], h = 1.5)
rnaseq.clusters.cluster1$lower
[[1]]
'dendrogram' with 2 branches and 891 members total, at height 1.27229
[[2]]
'dendrogram' with 2 branches and 48 members total, at height 1.374514
The second cluster did not need to be expored more here, but the first one certainly did.
plot(rnaseq.clusters.cluster1$lower[[1]], leaflab = "none", main = "RNAseq Cluster 1 Subcluster 1")
Next we added color.
plot(color_branches(rnaseq.clusters.cluster1$lower[[1]], k=2),leaflab="none", main = "RNAseq Cluster 1 Subcluster 1 with color")
Next was a cut to extract these two clusters.
rnaseq.clusters.cluster1.clusterAgain <- cut(rnaseq.clusters.cluster1$lower[[1]], h = 1.2)
rnaseq.clusters.cluster1.clusterAgain$lower
[[1]]
'dendrogram' with 2 branches and 545 members total, at height 1.095133
[[2]]
'dendrogram' with 2 branches and 346 members total, at height 1.117239
That is sufficient for cluster 1. We next moved to cluster 2.
plot(rnaseq.clusters$lower[[2]], leaflab = "none", main = "RNAseq Cluster 2")
There appeared to be a fair amount of clusters. We added color for different k values to see which one fit the best.
plot(color_branches(rnaseq.clusters$lower[[2]], k=4),leaflab="none", main = "RNAseq Cluster 2 with color")
None of these clusters needed to be explored further. We did extract them however.
rnaseq.clusters.cluster2 <- cut(rnaseq.clusters$lower[[2]], h = 1.6)
rnaseq.clusters.cluster2$lower
[[1]]
'dendrogram' with 2 branches and 234 members total, at height 1.48567
[[2]]
'dendrogram' with 2 branches and 188 members total, at height 1.486038
[[3]]
'dendrogram' with 2 branches and 491 members total, at height 1.396995
[[4]]
'dendrogram' with 2 branches and 401 members total, at height 1.567286
The third cluster was next.
plot(rnaseq.clusters$lower[[3]], leaflab = "none", main = "RNAseq Cluster 3")
In the above plot we saw 3 large clusters, so we added color to get a visual.
plot(color_branches(rnaseq.clusters$lower[[3]], k=3),leaflab="none", main = "RNAseq Cluster 3 with color")
We saw the first cluster (pink) needed to be explored further, so we cut the tree to extract it.
rnaseq.clusters.cluster3 <- cut(rnaseq.clusters$lower[[3]], h = 1.3)
rnaseq.clusters.cluster3$lower
[[1]]
'dendrogram' with 2 branches and 1507 members total, at height 1.229926
[[2]]
'dendrogram' with 2 branches and 24 members total, at height 1.224974
[[3]]
'dendrogram' with 2 branches and 558 members total, at height 1.268239
plot(rnaseq.clusters.cluster3$lower[[1]], leaflab = "none", main = "RNAseq Cluster 3 Subcluster 1")
Four main clusters popped out to us.
plot(color_branches(rnaseq.clusters.cluster3$lower[[1]], k=5),leaflab="none", main = "RNAseq Cluster 3 Subcluster1 with color")
rnaseq.clusters.cluster3.cutAgain <- cut(rnaseq.clusters.cluster3$lower[[1]], h = .85)
rnaseq.clusters.cluster3.cutAgain$lower
[[1]]
'dendrogram' with 2 branches and 265 members total, at height 0.8318486
[[2]]
'dendrogram' with 2 branches and 567 members total, at height 0.6561431
[[3]]
'dendrogram' with 2 branches and 47 members total, at height 0.7837043
[[4]]
'dendrogram' with 2 branches and 254 members total, at height 0.7389783
[[5]]
'dendrogram' with 2 branches and 374 members total, at height 0.7845916
Lastly, we looked at the fourth cluster.
plot(rnaseq.clusters$lower[[4]], leaflab = "none", main = "RNAseq Cluster 4")
plot(color_branches(rnaseq.clusters$lower[[4]], k=5),leaflab="none", main = "RNAseq Cluster 4 with color")
rnaseq.clusters.cluster4 <- cut(rnaseq.clusters$lower[[4]], h = 1.48)
rnaseq.clusters.cluster4$lower
[[1]]
'dendrogram' with 2 branches and 23 members total, at height 1.135677
[[2]]
'dendrogram' with 2 branches and 528 members total, at height 1.390056
[[3]]
'dendrogram' with 2 branches and 24 members total, at height 1.282748
[[4]]
'dendrogram' with 2 branches and 843 members total, at height 1.16056
[[5]]
'dendrogram' with 2 branches and 431 members total, at height 1.386127
Cluster 1
rnaseq.clusters.cluster1.clusterAgain\(lower[[1]] rnaseq.clusters.cluster1.clusterAgain\)lower[[2]]
Cluster 2
rnaseq.clusters.cluster2\(lower[[1]] rnaseq.clusters.cluster2\)lower[[2]] rnaseq.clusters.cluster2\(lower[[3]] rnaseq.clusters.cluster2\)lower[[4]]
Cluster 3
rnaseq.clusters.cluster3.cutAgain\(lower[[1]] rnaseq.clusters.cluster3.cutAgain\)lower[[2]] rnaseq.clusters.cluster3.cutAgain\(lower[[3]] rnaseq.clusters.cluster3.cutAgain\)lower[[4]] rnaseq.clusters.cluster3.cutAgain\(lower[[5]] rnaseq.clusters.cluster3\)lower[[2]] rnaseq.clusters.cluster3$lower[[3]]
Cluster 4
rnaseq.clusters.cluster4\(lower[[1]] rnaseq.clusters.cluster4\)lower[[2]] rnaseq.clusters.cluster4\(lower[[3]] rnaseq.clusters.cluster4\)lower[[4]] rnaseq.clusters.cluster4$lower[[5]]
The rnaSeq data seemed to produce more clusters than the microarray data. Specifically, smaller, more tightly bound clusters were found. This we believe is directly linked to the higher sampling rate used to record the rnaSeq data.
Using our “elbow” anaylsis, we determined that a small number of clusters is probably more meaningful than a large number. Specifically, we decided to do k-means analysis with k = 3, 6, and 8, because for more than 8 clusters, the elbow curve we generated started leveling out, meaning the added benefits (increased similarity between data points in a cluster) were minimal. The goal was to minimize the number of clusters while maximizes their internal similarity.
Below we carry out k-means clustering on both the microarray and rnaseq data, using k = 3, 6, and 8. For each k value, we also provide dendrograms to compare to hierarchical clustering results, and sample heatmaps for some of the clusters produced.
k=3
kuang.kmedoids <- pam(kuang.dist, k) # create k-medoids clustering with k clusters
kclusters <- kuang.kmedoids$cluster
table(kclusters)
kclusters
1 2 3
2534 1593 2114
Using the microarray data, k-means clustering with k=3 resulted in clusters of sizes 2534, 1593, and 2114. In the Kuang paper, they chose 3 main clusters of 537, 1608, and 980. This suggests that a larger k may produce smaller, more significant clusters.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(kuang.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(kuang.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(kuang.dend, kclusters, dend.colors),leaflab="none")
The figure above is the dendrogram with the k-means’ clusters superimposed using color as an indicator. If the k-means method matched the hierarchical method perfectly the three colors would be partitioned perfectly. This is not the case. However, the colors do clump together, which suggests the two methods produce similar results. Going forward, in each of the Compare to hierarchical clustering results sections please note that the more clumped the given colors are the closer the k-means results matches the hierarchical.
kuang_t.unique = subset(kuang_t.time, select=c(unique.genes, "time"))
kuang_t.long = gather(kuang_t.unique, gene, expression, -time)
#head(kuang_t)
#head(kuang_t.long)
#head(kuang_t.unique)
color.scheme <- rev(brewer.pal(8,"RdBu")) # generate the color scheme to use
kcluster1.genes <- names(kclusters[kclusters == 1])
kuang_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=3) Clustering of Kuang et al. Data\nCluster 1") +
theme(axis.text.y = element_text(size = 2)) # set size of y axis labels
This heat map shows gene expression over time. Again, we can clearly see oscillating expression (red, then red, then blue). It appears the pattern repeats three time.
kcluster2.genes <- names(kclusters[kclusters == 2])
kuang_t.long %>%
filter(gene %in% kcluster2.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=3) Clustering of Kuang et al. Data\nCluster 2") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
This shows the expression of the second cluster. Again there is repeated expression patterns. Also, this cluster starts with negative correlation, suggesting a delay from the first cluster. However, the expression across genes for a given time is not greatly correlated. Thus, it is clearly apparent k needs to be increased.
kcluster3.genes <- names(kclusters[kclusters == 3])
kuang_t.long %>%
filter(gene %in% kcluster3.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=3) Clustering of Kuang et al. Data\nCluster 3") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
To repeat the heatmap did not look “clean,” and thus k needs to be increased.
k=6
kuang.kmedoids <- pam(kuang.dist, k) # create k-medoids clustering with k clusters
kclusters <- kuang.kmedoids$cluster
table(kclusters)
kclusters
1 2 3 4 5 6
769 872 585 1372 977 1666
Using the microarray data, k-means clustering with k=6 resulted in clusters of sizes 769, 872, 585, 1372, 977 and 1666. These sizes are more similar in size to the 3 main clusters that the Kuang paper picked.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(kuang.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(kuang.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(kuang.dend, kclusters, dend.colors),leaflab="none")
kcluster1.genes <- names(kclusters[kclusters == 1])
kuang_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=6) Clustering of Kuang et al. Data\nCluster 1") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
k=8
kuang.kmedoids <- pam(kuang.dist, k) # create k-medoids clustering with k clusters
kclusters <- kuang.kmedoids$cluster
table(kclusters)
kclusters
1 2 3 4 5 6 7 8
781 597 469 1094 952 598 932 818
Using the microarray data, k-means clustering with k=8 resulted in clusters of sizes 781, 597, 469, 1094, 932 and 818. These sizes are also similar in size to the 3 main clusters that the Kuang paper picked.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(kuang.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(kuang.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(kuang.dend, kclusters, dend.colors),leaflab="none")
kcluster1.genes <- names(kclusters[kclusters == 1])
kuang_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=8) Clustering of Kuang et al. Data\nCluster 1") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
We wanted to highlight that as k increases the vibrancy of the vertical expression lines in the heat maps increased, as expected.
We repeated the heat map analysis for the rnaSeq data
k=3
rnaseq.kmedoids <- pam(rnaseq.dist, k) # create k-medoids clustering with k clusters
kclusters <- rnaseq.kmedoids$cluster
table(kclusters)
kclusters
1 2 3
1620 2549 2022
Using the rnaseq data, k-means clustering with k=3 resulted in clusters of sizes 1620, 2549, 2022. These are similar in size to those in the microarray k-means analysis.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(rnaseq.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(rnaseq.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(rnaseq.dend, kclusters, dend.colors),leaflab="none")
rnaseq.unique.genes <- colnames(rnaseq_t) %>% unique()
rnaseq_t.unique = subset(rnaseq_t.time, select=c(rnaseq.unique.genes, "time"))
rnaseq_t.long = gather(rnaseq_t.unique, gene, expression, -time)
color.scheme <- rev(brewer.pal(8,"RdBu")) # generate the color scheme to use
kcluster1.genes <- names(kclusters[kclusters == 1])
rnaseq_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=1) Clustering of Rnaseq Data\nCluster 1") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
One note-worthy occurence in performing a heat map on the rnaSeq data is the visible gaps. This is because the time data we use is given in discrete time intervals, but the heat map function assumes the input data in given in a continuous spectrum. This is true for the microarray data too, but the time intervals are closer together so the heatmaps look less sparse. The discrepancy in gap size (some are larger than others) indicate that certain time intervals were further apart than others.
head(rnaseq_t.long)
kcluster2.genes <- names(kclusters[kclusters == 2])
rnaseq_t.long %>%
filter(gene %in% kcluster2.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=2) Clustering of Rnaseq Data\nCluster 2") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
kcluster3.genes <- names(kclusters[kclusters == 3])
rnaseq_t.long %>%
filter(gene %in% kcluster3.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=1) Clustering of Rnaseq Data\nCluster 3") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
k=6
rnaseq.kmedoids <- pam(rnaseq.dist, k) # create k-medoids clustering with k clusters
kclusters <- rnaseq.kmedoids$cluster
table(kclusters)
kclusters
1 2 3 4 5 6
820 481 1787 1145 1048 910
Using the rnaseq data, k-means clustering with k=6 resulted in clusters of sizes 820, 481, 1787, 1145, 1048, and 910. These are similar in size to those in the microarray k-means analysis.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(rnaseq.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(rnaseq.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(rnaseq.dend, kclusters, dend.colors),leaflab="none")
kcluster1.genes <- names(kclusters[kclusters == 1])
rnaseq_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=6) Clustering of Rnaseq Data\nCluster 1") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
k=8
rnaseq.kmedoids <- pam(rnaseq.dist, k) # create k-medoids clustering with k clusters
kclusters <- rnaseq.kmedoids$cluster
table(kclusters)
kclusters
1 2 3 4 5 6 7 8
935 443 1198 836 718 668 627 766
Using the rnaseq data, k-means clustering with k=8 resulted in clusters of sizes 935, 443, 1198, 836, 718, 668, 627, 766. These are similar in size to those in the microarray k-means analysis.
# reorder genes so they match the dendrogram
kclusters <- kclusters[order.dendrogram(rnaseq.dend)]
# get branch colors so we're using the same palette
dend.colors <- unique(get_leaves_branches_attr(color_branches(rnaseq.dend, k=k), attr="col"))
# color the branches according to their k-means cluster assignment
plot(branches_attr_by_clusters(rnaseq.dend, kclusters, dend.colors),leaflab="none")
kcluster1.genes <- names(kclusters[kclusters == 1])
rnaseq_t.long %>%
filter(gene %in% kcluster1.genes) %>%
ggplot(aes(x = time, y = gene)) +
geom_tile(aes(fill = expression)) +
scale_fill_gradientn(colors=color.scheme, limits=c(-2,2)) +
labs(title = "K-medoids (k=8) Clustering of Rnaseq Data\nCluster 1") +
theme(axis.text.y = element_text(size = 6)) # set size of y axis labels
We performed complete linkage heirarchial cluster and K-means clustering for this assignment. The k-means and heirarchial clustering data produce similar result as shown by the figures that map k-means onto heirarchial clustering. We decided to use heirarchial clustering because we were used to breaking clusters into subclusters with this method.
We used 1-correlation as the measure of dissimilarity for the hierarchial clustering.
To determine the cut height, we looked how different number of clusters changed the dendogram (as shown in the Clustering section). We determined that 3 main clusters best fit the data. Then, we broke these 3 clusters into separate subclusters. Then we produced heat maps and line plots to see that the genes had similar expression over time.
We chose 3, 6, and 8 as the k for k-means. The reason we chose these values was described in the K-means clustering section.
Both forms of clustering worked well for our data. We chose to explore the hierarchial clustering more because we were more used to that form of analysis.
Described earlier in the assignment. Large distributions for 3 clusters. Broke those large clusters into subclusters of fewer genes.
Our findings were similar to the Kuang paper. We found 3 large clusters that behaved with similar oscillatory patterns but with different phases, suggesting a delay between the clusters’ oscillatory patterns. This supports the claim of the paper that there is a just in time supply chain in the function of the metabolic cycle. We broke the 3 large clusters into 10 subclusters and found that genes within a subcluster behaved with even more similar oscillatory patterns.
Dendogram (line 180), Heat Maps (lines 483-524), Separate Cluster Line Plot (lines 586-601)